++++++ This notebook will mainly be used for the Coursera Capstone Project ++++++

Capstone Project Coursera for Data Science: The battle of neighborhoods or the battle of "organic Food & Beverages" at the close vicinity of Berlin metro stations

IBM Data Science Course, attendee Dr.B.Bayer, June 2021

Introduction

Background Information

Berlin is the capital of Germany with about 3.7 million inhabitants and a city full of movement. On average, every Berliner travels more than three times a day. Local public transport plays a special role in this. Approximately 50 percent of Berlin's households are car-free, and with 324 cars per 1,000 inhabitants, the city has the lowest motorization rate in Germany. In addition, even people who do have a car often use other means of transportation. Public transportation also plays a central role for commuting to work. For example, about 40% of all commuters use public transportation. [1]

Another strongly growing trend is the demand for organic products and therefore incoming, the desire for a healthy diet [2]. In train stations and metro stations, there are already numerous possibilities for food intake, and so numerous small and large stores offer food and drinks to take away. However, these offers are very often not considered healthy at all and thus contradict the desire for a healthy diet. The market for healthy take-away products is still relatively small and this results in a large potential market for investors.

Problem Statement

With the aforementioned prospect, various stakeholders (entrepreneurs, investors) may be interested to explore the organic food and beverage (F&B) shop business opportunities in the very close vicinity of Berlins metro stations. This data science project is thus carried out to help them answer the following question: Which of the Berlins metro stations are strategic for opening an "organic F&B" business?

Data

In order to explore potential answer to the problem, the following data are required:

Methodology

This section represents the main components of the report. It starts with data extraction (web scraping) of Berlin metro stations and retrieval of geographical coordinates. Leveraging Foursquare API, these coordinates data are given as inputs to explore venues within the stations.

One-hot encoding is performed to analyze and narrow down the most common venues in each of the station. Given all the venues surrounding them, the stations are clustered using K-means algorithm. The number of optimal clusters is decided using the elbow method and silhouette score. Each cluster is separately analyzed to examine one discriminating venue that characterizes them. Analysis of the clusters and visualization will give insights as to where the strategic regions to set up the business.

The following cell contains all the necessary Python libraries.

Two main dataframes will be created for use in the analysis:

Web Scraping: Berlin tram stations and coordinates

The data to scrape are the names of all Berlin tram stations and their corresponding geographical coordinates. We first need to specify all the URLs of the webpages to which we will send a get request. For reference, Berlins tram stations are listed on Wikipedia page:

Data Cleaning

Keep only columns Station name (incl. coordinates) and locality

Some more cleaning

Geograhical coordinates conversion

Plot a nice map to show the stations

A total of 178 tram stations are present and taken into consideration.

Note the gap of tram stations in the eastern part of the city. This is due to the former separation of the city. The eastern part is still not very well developed w.r.t public transport via tram.

map_tramstations.png

Foursquare venue data

Before exploring the venues using Foursquare API, credentials and version must first be defined.

In the publication version the credentials were removed.

Define a function with search radius of 100 m around the stations coordinates. Venue limit entry 200.

The function will return a dataframe containing venues within defined radius of a region (i.e., a subdistrict), with the following details: venue name, venue category, venue latitude, venue longitude. The inputs to be provided are the names of the city, district, subdistrict, as well as the latitudes and longitudes.

Apply the function and save the results in a pandas dataframe

(This step may need several minutes.)

Removal of some unwanted categories such as "Metro Station" or "bus stop". Mainly categories which have nothing in common with food supply.

Plot for eating places

Various kinds of food places top the list of most common venues in Berlin tram stations. Organic and vegetarian food and beverages are not to be found within the top categories.

Determine Top 5 venues for each station

One Hot encoding

One-hot encoding will help to convert categorical variables (i.e., venues) into numeric variables. In this case, I will take the mean of the frequency of venue occurrence within a station.

The following cells contains a function that will help to sort venues of each station. In this analysis, the 5 most common venues each are taken under consideration.

Cluster of similar developed stations on venues similarity

The stations will be clustered or segmented based on a set of similar characteristics or features, i.e., their surrounding venues. K-Means clustering, which is used in this part of the analysis, is a machine learning algorithm that creates homogeneous subgroups/clusters from unlabeled data such that data points in each cluster are as similar as possible to each other according to a similarity measure (e.g., Euclidian distance).

K-Means Clustering

Selecting the features (X): all venue category columns from the one-hot encoding dataframe.

Determination of k (Ellbow methode)

Before proceeding, a value of k (number of clusters) needs to be determined. The Elbow Method below calculates the sum of squared distances of data points to their closest centroid (cluster center) for different values of k. The optimal value of k is the one after which there is a plateau (no significant decrease in sum of squared distances).

Because there is no discernible "elbow" from the plot, another measure was applied: Silhouette Score.

Silhouette score varies from -1 to 1. A score value of 1 means the cluster is dense and well-separated from other clusters. A value nearing 0 represents overlapping clusters, data points are close to the decision boundary of neighboring clusters. A negative score indicates that the samples might have been assigned into the wrong clusters.

From the plot above, there is a peak at k=5 with which I'll proceed with that value as the number of optimal clusters. However, both methodes the ellbow and silhoutte, are not very clearly and need further investigation.

Visualizing Clusters

Now that each station has been assigned a cluster label, it would be helpful to visualize the clusters on a map of Berlin to see how they are distributed. Folium library is used for this purpose.

map_cluster_final.png

Examining Each Cluster

Each cluster is filtered from the dataframe previously created in the clustering stage. The clusters are separately analyzed in order to gain an understanding of a discriminating venue that characterize each of them. Means, the 1st and 2nd most common venue category from each cluster will be singled out.

Cluster 0

Color code in map: wine red (or brown for some eyes)

Observation for Cluster 0: Bakeries are the prominent venue in this cluster 0 containing 13 stations.

Cluster 1

Color code in map: dark blue

Observation for Cluster 1: With 3 members (stations) the smallest cluster dominiated by restaurants (mexican) and wine shops.

Cluster 2

Color code in map: brighter blue

Observation for Cluster 2: Thirteen stations fall into this cluster. Dominiated by italian restaurants and wine shops.

Cluster 3

Color code in map: bright green

Observation for Cluster 3: Largest cluster with 92 member stations. Prominent are Coffee and Cafe places, followed by pizza and turkish food.

Cluster 4

Color code in map: orange

Observation for Cluster 4: Fourteen entries fall into this cluster. Dominated by Doner restaurants and again wine shops.

Results and Discussion

Exploratory data analysis as well as machine learning and visualization techniques have provided us with some insights into the problem at hand.

A total of 842 items originated by 184 venue catagories for all 178 Berlin metro stations regions were returned at the time the API call was made. The search radius was chosen quite narrow with 100 m. After removing venue cateories not of interest for the regarded food industry (such as the tram station itself, gym, IT-Sevices etc), 85 unique categories were being left after modification. The most common categories overall are 1. Bakeries, 2. Doner restaurants, 3. Cafe, 4. Coffee Shops, and 5. Italian and Pizza places.

After deciding on an optimal k value of 5, K-Means algorithm was run to cluster the stations based on their most common surrounding venues. To determine this optimal k value, two common methodes Elbow and silhouette were applied. The result of k = 5 result is ambiguous and needs further investigation.

Each of the five clusters, labeled 0-4, is characterized by dominant venues as follows:

Cluster Label Member Common Venue
0 13 Bakeries
1 3 Mexican and Wine Shops
2 13 Italian and Wine Shops
3 92 Coffee/Cafe, Pizza, Turkish food
4 14 Doner restaurants and Wine Shops

A considerable number of coffee shops and bakeries as well as Turkish food and wine shops are present. Categories indicating "healthy" food such as "vegan / vegetarian places" are not awarded to be unter the Top 5. In fact, such places are very very rare and therefore, it is recommended that stakeholders look into opportunities allover Berlin stations to start a business with organic food and beverages.

Conclusion

Stakeholders searching for opportunities to open organic food and beverages (incl. vegan / vegetarian dishes) may want to consider setting up their business someplace where competitions are not severe. This study has shown that in the very close proximity of metro/tram stations of Berlin (radius of 100 m) such places don't exist and, therefore, such places are among the best candidates for organic food and beverages location.

References

[1] https://www.cnb-online.de/hintergruende/zahlen-und-fakten-zum-oepnv/
[2] https://www.oekolandbau.de/handel/marktinformationen/europaeischer-bio-markt-waechst-auf-ueber-40-milliarden-euro/
[3] https://de.wikipedia.org/wiki/Liste_der_Berliner_U-Bahnh%C3%B6fe
[4] https://developer.foursquare.com/